HIV Viral Load Prediction
HIV Viral Load Prediction
- Introduction
- Preliminary Analysis
- Data Exploration
- Covariates
- Logistic Regression Model
- Penalised Logistic Regression Model
- KNN
- Classification and Regression Trees (CART)
- XGBOOST
- GBM
- Random Forest
- BART (Bayesian Additive Regression Trees)
- SVM
- Comparative Analysis
- Cutoff Optimization
- Stacked Ensembling.
- Appendix
- Source Code
Introduction
Premise: Prognostic prediction has empirically proven to be a highly effective paradigm that is radically reshaping public health, clinical medicine, and healthcare as a domain. Despite its immense benefits, far too little has been done in utilizing its promises to guide the current WHO HIV viral load informed care model. Currently, estimating the risk of virologic failure is typically at the discretion of the clinician and heavily based on the provider’s opinion. This is compounded by other complications, often leading to delayed administration of essential interventions. Particularly in resource-limited settings, commonly characterized by an acute shortage of healthcare workforce.
Objective: To achieve maximum health impact of the current WHO model, timely detection of potential virologic failure is critical in preventing adverse clinical trajectories; such as treatment failure and immunological deterioration. As such, this project aims to assess modern techniques that can be utilized in HIV clinical settings with rich EHR data to proactively anticipate and mitigate the risk of virologic failures before they manifest.
Methods: A series of statistical learning models consisting of parametric, non-parametric, ensembles, and Bayesian approach will be trained and evaluated using dataset extracted from https://www.iedea.org/. IeDEA is an international research consortium established in 2006 by the National Institute of Allergy and Infectious Diseases to provide a rich resource for globally diverse HIV/AIDS data. Cross-validated metrics such as sensitivity and specificity will be used to evaluate the performance of each model in distinguishing between low-risk and high-risk patients.
Datasource: The IeDEA Cohort Consortium collaborates hosts deidentified data on 1.7 million HIV/AIDS patients. Data is collected from seven international regions, including four in Africa, and one each in the Asia-Pacific region, the Central/South America/Caribbean region, and North America. Each region has data centers that consolidate, curate, and analyze data.
Preliminary Analysis
Data Exploration
In this section we will explore our data before using it to create 3 models for the binary classification task of predicting whether or not a patient is suppressed using classical binary classification models.
Import Wrangled Data
This is a 1 row per patient dataset created in the previous section with the following features. Note, not all of these features wil be used for prediction. We will only be using baseline features.
Covariates
| type | variable | missing | complete | n | n_unique | top_counts | ordered | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| factor | abdominal_pexam | 0 | 21260 | 21260 | 3 | Nor: 13417, Ukn: 7666, Abn: 177, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | arv_change_reason | 0 | 21260 | 21260 | 18 | N/A: 20560, 829: 182, 193: 145, 102: 114 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | arv_elgibility_reason | 0 | 21260 | 21260 | 19 | 562: 14140, 950: 3879, 120: 671, 177: 600 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | bmi_status | 0 | 21260 | 21260 | 5 | Nor: 11616, Und: 4251, Ove: 2646, Obe: 1603 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | cardiac_pexam | 0 | 21260 | 21260 | 3 | Nor: 15670, Ukn: 5554, Abn: 36, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | contraceptive | 0 | 21260 | 21260 | 25 | 110: 11127, 190: 6475, 527: 950, 907: 634 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | cryptococcus_tx | 0 | 21260 | 21260 | 2 | 110: 21137, 747: 123, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | cur_pcp_prophylaxis | 0 | 21260 | 21260 | 3 | 916: 15092, 110: 6025, 92: 143, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | cxr_code_labs | 0 | 21260 | 21260 | 3 | Ukn: 19998, Nor: 788, Abn: 474, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | drug_toxicity_cause | 0 | 21260 | 21260 | 13 | 110: 21121, 512: 45, 562: 36, 3: 12 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | drug_toxicity_effects | 0 | 21260 | 21260 | 26 | 110: 21103, 512: 52, 877: 16, 562: 14 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | drug_toxicity_severity | 0 | 21260 | 21260 | 4 | 110: 21132, 174: 63, 174: 40, 174: 25 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | extremies_pexam | 0 | 21260 | 21260 | 3 | Nor: 13488, Ukn: 7643, Abn: 129, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | family_tx_support | 0 | 21260 | 21260 | 26 | 110: 18341, 140: 707, 727: 580, 727: 567 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | first_arv_adherence | 0 | 21260 | 21260 | 4 | Unk: 19026, GOO: 2151, POO: 50, FAI: 33 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | first_arv_meds | 0 | 21260 | 21260 | 84 | 0: 10767, 696: 7091, 106: 938, 106: 767 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | first_location | 0 | 21260 | 21260 | 82 | loc: 2348, loc: 2228, loc: 1835, loc: 1147 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | general_pexam | 0 | 21260 | 21260 | 3 | Ukn: 16184, Nor: 3330, Abn: 1746, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | health_cover | 0 | 21260 | 21260 | 7 | 110: 16230, NHI: 2725, 106: 1342, 562: 943 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | heent_pexam | 0 | 21260 | 21260 | 3 | Nor: 13349, Ukn: 7648, Abn: 263, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | hospitalization_loc | 0 | 21260 | 21260 | 4 | N/A: 21114, 127: 71, 127: 51, 562: 24 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | hospitalization_rsn | 0 | 21260 | 21260 | 103 | N/A: 20889, 123: 80, 43: 29, 197: 23 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | immunization_status | 0 | 21260 | 21260 | 5 | N/A: 21117, 106: 79, 562: 53, 106: 9 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | last_arv_adherence | 0 | 21260 | 21260 | 4 | Unk: 10823, GOO: 10195, POO: 122, FAI: 120 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | last_arv_meds | 0 | 21260 | 21260 | 96 | 696: 11977, 0: 6461, 646: 587, 631: 431 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | last_location | 0 | 21260 | 21260 | 80 | loc: 2333, loc: 2244, loc: 1998, loc: 1133 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | lymph_nodes_pexam | 0 | 21260 | 21260 | 3 | Nor: 13210, Ukn: 7715, Abn: 335, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | musculoskeletal_pexam | 0 | 21260 | 21260 | 3 | Nor: 13511, Ukn: 7681, Abn: 68, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | neurologic_pexam | 0 | 21260 | 21260 | 3 | Nor: 13355, Ukn: 7868, Abn: 37, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | not_onart_reason | 0 | 21260 | 21260 | 5 | N/A: 20375, 143: 584, 562: 184, 548: 80 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | nutrition_status | 0 | 21260 | 21260 | 6 | N/A: 17371, 111: 3235, 947: 278, 689: 170 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | pcp_change_reason | 0 | 21260 | 21260 | 5 | N/A: 21159, 102: 69, 562: 29, 704: 2 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | pcp_prophy_adherence | 0 | 21260 | 21260 | 9 | 634: 9927, N/A: 6865, 116: 4157, 665: 135 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | phdp_referral | 0 | 21260 | 21260 | 10 | 110: 11397, 548: 7617, 117: 1321, 830: 504 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | point_of_hiv_daignosis | 0 | 21260 | 21260 | 13 | 562: 14658, 217: 3847, 204: 1508, 562: 282 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | poor_adherence_rsn | 0 | 21260 | 21260 | 18 | 110: 20251, 164: 218, 610: 201, 562: 150 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | psychiatric_pexam | 0 | 21260 | 21260 | 3 | Nor: 13426, Ukn: 7779, Abn: 55, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | pulse_status | 0 | 21260 | 21260 | 3 | Nor: 17204, hig: 3032, low: 1024, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | referral_ordered | 0 | 21260 | 21260 | 30 | 110: 12109, 548: 2829, 548: 2278, 158: 863 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | respiratory_pexam | 0 | 21260 | 21260 | 3 | Nor: 13088, Ukn: 7598, Abn: 574, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | skin_pexam | 0 | 21260 | 21260 | 3 | Nor: 12526, Ukn: 7653, Abn: 1081, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | sti_symptoms | 0 | 21260 | 21260 | 20 | 110: 20069, 599: 204, 620: 171, 623: 133 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | sulf_peni_other_reactions | 0 | 21260 | 21260 | 8 | 110: 21171, 512: 38, 562: 23, 879: 15 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | tb_assmt_status | 0 | 21260 | 21260 | 4 | 110: 20033, 697: 746, 617: 415, 111: 66 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | tb_prop_change_rsn | 0 | 21260 | 21260 | 6 | N/A: 20307, 126: 879, 562: 33, 102: 31 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | tb_prophy_regimen | 0 | 21260 | 21260 | 2 | 110: 17017, 656: 4243, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | tb_symptoms | 0 | 21260 | 21260 | 15 | 110: 19443, 617: 908, 136: 235, 596: 199 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | tb_tx_change_rsn | 0 | 21260 | 21260 | 4 | N/A: 21114, 126: 119, 562: 16, 102: 11 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | tb_tx_phase | 0 | 21260 | 21260 | 5 | 110: 20730, 619: 373, 619: 154, 619: 2 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | tb_tx_regimen | 0 | 21260 | 21260 | 14 | 110: 20048, 113: 601, 106: 517, 119: 68 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | tb_tx_restart_rsn | 0 | 21260 | 21260 | 6 | N/A: 21069, 697: 166, 697: 15, 698: 8 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | toxic_drug | 0 | 21260 | 21260 | 12 | 110: 21200, 633: 21, 916: 14, 656: 8 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | urogenital_pexam | 0 | 21260 | 21260 | 3 | Nor: 11376, Ukn: 5609, Abn: 4275, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | vl_1_date | 0 | 21260 | 21260 | 733 | emp: 19744, 201: 9, 201: 9, 201: 8 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
| integer | adherence_changes | 0 | 21260 | 21260 | NA | NA | NA | 0.051 | 0.35 | 0 | 0 | 0 | 0 | 9 | ▇▁▁▁▁▁▁▁ |
| integer | alcohol_consumer | 0 | 21260 | 21260 | NA | NA | NA | 0.14 | 0.34 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | arv_lines_changed | 0 | 21260 | 21260 | NA | NA | NA | 0.007 | 0.1 | 0 | 0 | 0 | 0 | 5 | ▇▁▁▁▁▁▁▁ |
| integer | arv_meds_changed | 0 | 21260 | 21260 | NA | NA | NA | 0.15 | 0.48 | 0 | 0 | 0 | 0 | 7 | ▇▁▁▁▁▁▁▁ |
| integer | changed_location | 0 | 21260 | 21260 | NA | NA | NA | 0.2 | 0.76 | 0 | 0 | 0 | 0 | 11 | ▇▁▁▁▁▁▁▁ |
| integer | changed_who_stages | 0 | 21260 | 21260 | NA | NA | NA | 0.1 | 0.47 | 0 | 0 | 0 | 0 | 7 | ▇▁▁▁▁▁▁▁ |
| integer | cig_smoker | 0 | 21260 | 21260 | NA | NA | NA | 0.051 | 0.22 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | clinical_problem_rptd | 0 | 21260 | 21260 | NA | NA | NA | 0.13 | 0.33 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | crag_labs | 0 | 21260 | 21260 | NA | NA | NA | 664 | 0 | 664 | 664 | 664 | 664 | 664 | ▁▁▁▇▁▁▁▁ |
| integer | cur_on_other_meds | 0 | 21260 | 21260 | NA | NA | NA | 0.32 | 0.47 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▁▁▁▃ |
| integer | days_b4_next_vl | 0 | 21260 | 21260 | NA | NA | NA | 70.06 | 2501.88 | 0 | 0 | 1 | 57 | 364401 | ▇▁▁▁▁▁▁▁ |
| integer | days_btwn_apptmts | 0 | 21260 | 21260 | NA | NA | NA | 0.18 | 24.68 | 0 | 0 | 0 | 0 | 3597 | ▇▁▁▁▁▁▁▁ |
| integer | facility_volume | 0 | 21260 | 21260 | NA | NA | NA | 7402.74 | 4519.18 | 1 | 4805 | 6211 | 9833 | 16564 | ▃▁▇▃▅▁▂▂ |
| integer | first_age | 0 | 21260 | 21260 | NA | NA | NA | 38.12 | 11.02 | 18 | 30 | 36 | 44 | 103 | ▃▇▅▂▁▁▁▁ |
| integer | first_arv_line | 0 | 21260 | 21260 | NA | NA | NA | 0.47 | 0.57 | 0 | 0 | 0 | 1 | 3 | ▇▁▆▁▁▁▁▁ |
| integer | first_days_pregnant | 0 | 21260 | 21260 | NA | NA | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ▁▁▁▇▁▁▁▁ |
| integer | first_pcs | 0 | 21260 | 21260 | NA | NA | NA | 6076.1 | 337.27 | 1286 | 6101 | 6101 | 6101 | 6101 | ▁▁▁▁▁▁▁▇ |
| integer | first_who_stage | 0 | 21260 | 21260 | NA | NA | NA | 0.62 | 1 | 0 | 0 | 0 | 1 | 4 | ▇▂▁▁▁▁▁▁ |
| integer | has_abnormal_oxy_sat | 0 | 21260 | 21260 | NA | NA | NA | 0.022 | 0.15 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_been_hospitalized | 0 | 21260 | 21260 | NA | NA | NA | 0.017 | 0.13 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_changed_pcp | 0 | 21260 | 21260 | NA | NA | NA | 0.0048 | 0.069 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_changed_tb_prop | 0 | 21260 | 21260 | NA | NA | NA | 0.045 | 0.21 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_changed_tb_tx | 0 | 21260 | 21260 | NA | NA | NA | 0.0069 | 0.083 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_drug_tox_efcts | 0 | 21260 | 21260 | NA | NA | NA | 0.0074 | 0.086 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_fever | 0 | 21260 | 21260 | NA | NA | NA | 0.015 | 0.12 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_heptis_b | 0 | 21260 | 21260 | NA | NA | NA | 0.00028 | 0.017 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_high_bp | 0 | 21260 | 21260 | NA | NA | NA | 0.36 | 0.48 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▁▁▁▅ |
| integer | has_low_bp | 0 | 21260 | 21260 | NA | NA | NA | 0.024 | 0.15 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_phdp_referral | 0 | 21260 | 21260 | NA | NA | NA | 0.46 | 0.5 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▁▁▁▇ |
| integer | has_referral_order | 0 | 21260 | 21260 | NA | NA | NA | 0.43 | 0.5 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▁▁▁▆ |
| integer | has_restarted_tb_tx | 0 | 21260 | 21260 | NA | NA | NA | 0.009 | 0.094 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_sti_symptoms | 0 | 21260 | 21260 | NA | NA | NA | 0.056 | 0.23 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_sulf_peni_rxns | 0 | 21260 | 21260 | NA | NA | NA | 0.0042 | 0.065 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_tb_symptoms | 0 | 21260 | 21260 | NA | NA | NA | 0.085 | 0.28 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_toxic_drug | 0 | 21260 | 21260 | NA | NA | NA | 0.0028 | 0.053 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | has_used_contraceptive | 0 | 21260 | 21260 | NA | NA | NA | 0.07 | 0.25 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | having_drug_toxicity | 0 | 21260 | 21260 | NA | NA | NA | 0.0095 | 0.097 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | hospitalized_recently | 0 | 21260 | 21260 | NA | NA | NA | 0.56 | 0.5 | 0 | 0 | 1 | 1 | 1 | ▆▁▁▁▁▁▁▇ |
| integer | is_abdominal_pexam | 0 | 21260 | 21260 | NA | NA | NA | 0.0083 | 0.091 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_breastfeeding | 0 | 21260 | 21260 | NA | NA | NA | 0.014 | 0.12 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_cardiac_pexam | 0 | 21260 | 21260 | NA | NA | NA | 0.0017 | 0.041 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_cxr_code_labs | 0 | 21260 | 21260 | NA | NA | NA | 0.022 | 0.15 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_extremies_pexam | 0 | 21260 | 21260 | NA | NA | NA | 0.0061 | 0.078 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_general_pexam | 0 | 21260 | 21260 | NA | NA | NA | 0.082 | 0.27 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_heent_pexam | 0 | 21260 | 21260 | NA | NA | NA | 0.012 | 0.11 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_lymph_nodes_pexam | 0 | 21260 | 21260 | NA | NA | NA | 0.016 | 0.12 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_male | 0 | 21260 | 21260 | NA | NA | NA | 0.32 | 0.47 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▁▁▁▃ |
| integer | is_musculoskeletal_pexam | 0 | 21260 | 21260 | NA | NA | NA | 0.0032 | 0.056 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_neurologic_pexam | 0 | 21260 | 21260 | NA | NA | NA | 0.0017 | 0.042 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_on_contraceptive | 0 | 21260 | 21260 | NA | NA | NA | 0.48 | 0.5 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▁▁▁▇ |
| integer | is_on_cryptococcus_tx | 0 | 21260 | 21260 | NA | NA | NA | 0.0058 | 0.076 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_on_health_cover | 0 | 21260 | 21260 | NA | NA | NA | 0.24 | 0.43 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▂ |
| integer | is_on_tb_prophy_regimen | 0 | 21260 | 21260 | NA | NA | NA | 0.2 | 0.4 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▂ |
| integer | is_psychiatric_pexam | 0 | 21260 | 21260 | NA | NA | NA | 0.0026 | 0.051 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_respiratory_pexam | 0 | 21260 | 21260 | NA | NA | NA | 0.027 | 0.16 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_skin_pexam | 0 | 21260 | 21260 | NA | NA | NA | 0.051 | 0.22 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_status_disclosed | 0 | 21260 | 21260 | NA | NA | NA | 0.1 | 0.3 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_symptomatic | 0 | 21260 | 21260 | NA | NA | NA | 0.018 | 0.13 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | is_underweight | 0 | 21260 | 21260 | NA | NA | NA | 0.2 | 0.4 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▂ |
| integer | is_urogenital_pexam | 0 | 21260 | 21260 | NA | NA | NA | 0.2 | 0.4 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▂ |
| integer | last_age | 0 | 21260 | 21260 | NA | NA | NA | 38.12 | 11.02 | 18 | 30 | 36 | 44 | 103 | ▃▇▅▂▁▁▁▁ |
| integer | last_arv_line | 0 | 21260 | 21260 | NA | NA | NA | 0.71 | 0.55 | 0 | 0 | 1 | 1 | 3 | ▃▁▇▁▁▁▁▁ |
| integer | last_days_pregnant | 0 | 21260 | 21260 | NA | NA | NA | 6.02 | 25.71 | 0 | 0 | 0 | 0 | 266 | ▇▁▁▁▁▁▁▁ |
| integer | last_pcs | 0 | 21260 | 21260 | NA | NA | NA | 6003.92 | 670.13 | 1286 | 6101 | 6101 | 6101 | 9068 | ▁▁▁▁▇▁▁▁ |
| integer | last_who_stage | 0 | 21260 | 21260 | NA | NA | NA | 1.01 | 1.15 | 0 | 0 | 1 | 2 | 4 | ▇▆▁▂▁▂▁▁ |
| integer | max_days_btwn_apptmts | 0 | 21260 | 21260 | NA | NA | NA | 58.97 | 112.34 | 0 | 0 | 30 | 73.25 | 3597 | ▇▁▁▁▁▁▁▁ |
| integer | max_days_on_arvs | 0 | 21260 | 21260 | NA | NA | NA | 148.76 | 355.1 | -85 | 0 | 15 | 188 | 5480 | ▇▁▁▁▁▁▁▁ |
| integer | max_days_on_treatment | 0 | 21260 | 21260 | NA | NA | NA | 155.52 | 267.08 | 0 | 0 | 68 | 210 | 16398 | ▇▁▁▁▁▁▁▁ |
| integer | max_days_pregnant | 0 | 21260 | 21260 | NA | NA | NA | 7.58 | 29.15 | 0 | 0 | 0 | 0 | 267 | ▇▁▁▁▁▁▁▁ |
| integer | max_tb_prop_days | 0 | 21260 | 21260 | NA | NA | NA | 15.06 | 51.73 | 0 | 0 | 0 | 0 | 1184 | ▇▁▁▁▁▁▁▁ |
| integer | max_tb_tx_days | 0 | 21260 | 21260 | NA | NA | NA | 11.47 | 69.98 | 0 | 0 | 0 | 0 | 3762 | ▇▁▁▁▁▁▁▁ |
| integer | min_days_btwn_apptmts | 0 | 21260 | 21260 | NA | NA | NA | 0.006 | 0.59 | 0 | 0 | 0 | 0 | 77 | ▇▁▁▁▁▁▁▁ |
| integer | min_days_on_arvs | 0 | 21260 | 21260 | NA | NA | NA | 46.87 | 316.09 | -85 | 0 | 0 | 1 | 5480 | ▇▁▁▁▁▁▁▁ |
| integer | min_days_on_treatment | 0 | 21260 | 21260 | NA | NA | NA | 21.25 | 193.7 | 0 | 0 | 0 | 0 | 16259 | ▇▁▁▁▁▁▁▁ |
| integer | min_days_pregnant | 0 | 21260 | 21260 | NA | NA | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ▁▁▁▇▁▁▁▁ |
| integer | min_tb_prop_days | 0 | 21260 | 21260 | NA | NA | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ▁▁▁▇▁▁▁▁ |
| integer | min_tb_tx_days | 0 | 21260 | 21260 | NA | NA | NA | 0.48 | 9.78 | 0 | 0 | 0 | 0 | 1101 | ▇▁▁▁▁▁▁▁ |
| integer | needs_fam_tx_support | 0 | 21260 | 21260 | NA | NA | NA | 0.14 | 0.34 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| integer | num_bad_adherence | 0 | 21260 | 21260 | NA | NA | NA | 0.047 | 0.29 | 0 | 0 | 0 | 0 | 9 | ▇▁▁▁▁▁▁▁ |
| integer | num_days_in_care | 0 | 21260 | 21260 | NA | NA | NA | 155.52 | 267.08 | 0 | 0 | 68 | 210 | 16398 | ▇▁▁▁▁▁▁▁ |
| integer | num_days_on_arvs | 0 | 21260 | 21260 | NA | NA | NA | 148.76 | 355.1 | -85 | 0 | 15 | 188 | 5480 | ▇▁▁▁▁▁▁▁ |
| integer | num_days_on_tb_meds | 0 | 21260 | 21260 | NA | NA | NA | 11.47 | 69.98 | 0 | 0 | 0 | 0 | 3762 | ▇▁▁▁▁▁▁▁ |
| integer | num_days_on_tb_prop | 0 | 21260 | 21260 | NA | NA | NA | 15.06 | 51.73 | 0 | 0 | 0 | 0 | 1184 | ▇▁▁▁▁▁▁▁ |
| integer | num_defaulted_apptmt | 0 | 21260 | 21260 | NA | NA | NA | 0.21 | 0.53 | 0 | 0 | 0 | 0 | 7 | ▇▁▁▁▁▁▁▁ |
| integer | num_encounters | 0 | 21260 | 21260 | NA | NA | NA | 4.04 | 3.62 | 1 | 1 | 3 | 6 | 29 | ▇▃▁▁▁▁▁▁ |
| integer | num_encs_b4_vl1 | 0 | 21260 | 21260 | NA | NA | NA | 3.97 | 3.68 | 0 | 1 | 3 | 6 | 29 | ▇▃▂▁▁▁▁▁ |
| integer | num_pcs_changes | 0 | 21260 | 21260 | NA | NA | NA | 0.029 | 0.21 | 0 | 0 | 0 | 0 | 4 | ▇▁▁▁▁▁▁▁ |
| integer | other_meds_allergy | 0 | 21260 | 21260 | NA | NA | NA | 0.57 | 0.5 | 0 | 0 | 1 | 1 | 1 | ▆▁▁▁▁▁▁▇ |
| integer | penicillin_allergy | 0 | 21260 | 21260 | NA | NA | NA | 0.67 | 0.47 | 0 | 0 | 1 | 1 | 1 | ▃▁▁▁▁▁▁▇ |
| integer | person_id | 0 | 21260 | 21260 | NA | NA | NA | 784602.92 | 87672.43 | 55952 | 767575.75 | 794491.5 | 827641.25 | 861938 | ▁▁▁▁▁▁▂▇ |
| integer | sulfa_allergy | 0 | 21260 | 21260 | NA | NA | NA | 0.67 | 0.47 | 0 | 0 | 1 | 1 | 1 | ▃▁▁▁▁▁▁▇ |
| integer | suppressed | 0 | 21260 | 21260 | NA | NA | NA | 0.29 | 0.45 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▁▁▁▃ |
| integer | tb_afb_labs | 0 | 21260 | 21260 | NA | NA | NA | 664.99 | 39.17 | 664 | 664 | 664 | 664 | 2303 | ▇▁▁▁▁▁▁▁ |
| integer | tb_culture_labs | 0 | 21260 | 21260 | NA | NA | NA | 664 | 0.27 | 664 | 664 | 664 | 664 | 703 | ▇▁▁▁▁▁▁▁ |
| integer | tb_gene_xp_labs | 0 | 21260 | 21260 | NA | NA | NA | 664.05 | 1.34 | 664 | 664 | 664 | 664 | 703 | ▇▁▁▁▁▁▁▁ |
| integer | vdrl_labs | 0 | 21260 | 21260 | NA | NA | NA | 666.05 | 33.87 | 664 | 664 | 664 | 664 | 1229 | ▇▁▁▁▁▁▁▁ |
| integer | vl_count_1 | 13536 | 7724 | 21260 | NA | NA | NA | 26920.51 | 2e+05 | 0 | 0 | 0 | 581.25 | 8e+06 | ▇▁▁▁▁▁▁▁ |
| integer | vl_count_2 | 16645 | 4615 | 21260 | NA | NA | NA | 17165.74 | 175479.66 | 0 | 0 | 0 | 472 | 1e+07 | ▇▁▁▁▁▁▁▁ |
| integer | vl_count_3 | 18850 | 2410 | 21260 | NA | NA | NA | 11132.73 | 88214.64 | 0 | 0 | 0 | 549.5 | 3266217 | ▇▁▁▁▁▁▁▁ |
| integer | vl_count_4 | 20288 | 972 | 21260 | NA | NA | NA | 12718.32 | 67014.09 | 0 | 0 | 55 | 652.5 | 987760 | ▇▁▁▁▁▁▁▁ |
| integer | vl_count_5 | 21016 | 244 | 21260 | NA | NA | NA | 20387.95 | 82791.9 | 0 | 0 | 136 | 2144.5 | 869508 | ▇▁▁▁▁▁▁▁ |
| integer | vl_count_6 | 21199 | 61 | 21260 | NA | NA | NA | 24954.18 | 67061.61 | 0 | 0 | 780 | 12325 | 362511 | ▇▁▁▁▁▁▁▁ |
| integer | vl_count_7 | 21243 | 17 | 21260 | NA | NA | NA | 10996.06 | 36603.48 | 0 | 0 | 0 | 747 | 150360 | ▇▁▁▁▁▁▁▁ |
| integer | vl_count_8 | 21260 | 0 | 21260 | NA | NA | NA | NaN | NA | NA | NA | NA | NA | NA | |
| numeric | avg_cd4_perc | 21251 | 9 | 21260 | NA | NA | NA | 35.78 | 30.59 | 2 | 13 | 26 | 49 | 98 | ▇▅▁▅▂▁▁▂ |
| numeric | avg_days_btwn_apptmts | 0 | 21260 | 21260 | NA | NA | NA | 22.82 | 33.33 | 0 | 0 | 17.25 | 33 | 928 | ▇▁▁▁▁▁▁▁ |
| numeric | avg_days_on_arvs | 0 | 21260 | 21260 | NA | NA | NA | 86.04 | 322.92 | -85 | 0 | 7.5 | 72.63 | 5480 | ▇▁▁▁▁▁▁▁ |
| numeric | avg_days_on_treatment | 0 | 21260 | 21260 | NA | NA | NA | 80.21 | 210.05 | 0 | 0 | 33.25 | 92.23 | 16328.5 | ▇▁▁▁▁▁▁▁ |
| numeric | avg_dbp | 682 | 20578 | 21260 | NA | NA | NA | 70.66 | 9.79 | 0 | 64.25 | 70 | 76 | 156 | ▁▁▁▇▂▁▁▁ |
| numeric | avg_oxy_sat | 3708 | 17552 | 21260 | NA | NA | NA | 96.5 | 4.35 | 0 | 96 | 97.33 | 98 | 100 | ▁▁▁▁▁▁▁▇ |
| numeric | avg_pulse | 659 | 20601 | 21260 | NA | NA | NA | 90.25 | 17.53 | 0 | 78.5 | 88 | 99.67 | 198 | ▁▁▂▇▃▁▁▁ |
| numeric | avg_sbp | 678 | 20582 | 21260 | NA | NA | NA | 114.02 | 14.66 | 0 | 105 | 112.5 | 120.59 | 243 | ▁▁▁▇▂▁▁▁ |
| numeric | avg_temp | 1833 | 19427 | 21260 | NA | NA | NA | 36.46 | 0.66 | 26 | 36.1 | 36.47 | 36.8 | 42 | ▁▁▁▁▂▇▁▁ |
| numeric | avg_weight | 0 | 21260 | 21260 | NA | NA | NA | 59.43 | 12.28 | 0 | 51.67 | 58 | 65.65 | 181 | ▁▁▇▂▁▁▁▁ |
| numeric | bmi | 0 | 21260 | 21260 | NA | NA | NA | 28.46 | 133.23 | 0 | 19.02 | 21.26 | 24.13 | 7098.34 | ▇▁▁▁▁▁▁▁ |
| numeric | first_cd4_perc | 21253 | 7 | 21260 | NA | NA | NA | 29.14 | 22.58 | 2 | 10.5 | 26 | 48 | 59 | ▇▃▁▃▁▁▇▃ |
| numeric | first_dbp | 1393 | 19867 | 21260 | NA | NA | NA | 70.67 | 11.77 | 0 | 60 | 70 | 79 | 161 | ▁▁▃▇▂▁▁▁ |
| numeric | first_oxy_sat | 4625 | 16635 | 21260 | NA | NA | NA | 96.25 | 5.26 | 0 | 96 | 97 | 98 | 100 | ▁▁▁▁▁▁▁▇ |
| numeric | first_pulse | 1254 | 20006 | 21260 | NA | NA | NA | 91.19 | 20.78 | 0 | 78 | 88 | 103 | 216 | ▁▁▆▇▂▁▁▁ |
| numeric | first_sbp | 1382 | 19878 | 21260 | NA | NA | NA | 113.74 | 17.22 | 0 | 100 | 111 | 121 | 243 | ▁▁▁▇▂▁▁▁ |
| numeric | first_temp | 2987 | 18273 | 21260 | NA | NA | NA | 36.55 | 0.82 | 25.8 | 36.1 | 36.6 | 37 | 40.7 | ▁▁▁▁▁▇▃▁ |
| numeric | first_weight | 71 | 21189 | 21260 | NA | NA | NA | 58.98 | 12.92 | 0 | 51 | 58 | 65 | 181 | ▁▁▇▂▁▁▁▁ |
| numeric | height | 0 | 21260 | 21260 | NA | NA | NA | 163.66 | 16.48 | 10 | 159 | 165 | 171 | 260 | ▁▁▁▁▇▆▁▁ |
| numeric | last_cd4_perc | 21251 | 9 | 21260 | NA | NA | NA | 35.78 | 30.59 | 2 | 13 | 26 | 49 | 98 | ▇▅▁▅▂▁▁▂ |
| numeric | last_dbp | 682 | 20578 | 21260 | NA | NA | NA | 70.84 | 11.54 | 0 | 61 | 70 | 79 | 156 | ▁▁▁▇▃▁▁▁ |
| numeric | last_oxy_sat | 3708 | 17552 | 21260 | NA | NA | NA | 96.55 | 5.49 | 0 | 96 | 98 | 98 | 100 | ▁▁▁▁▁▁▁▇ |
| numeric | last_pulse | 659 | 20601 | 21260 | NA | NA | NA | 89.21 | 19.69 | 0 | 76 | 87 | 100 | 214 | ▁▁▆▇▂▁▁▁ |
| numeric | last_sbp | 678 | 20582 | 21260 | NA | NA | NA | 114.5 | 16.78 | 0 | 103 | 113 | 122 | 243 | ▁▁▁▇▃▁▁▁ |
| numeric | last_temp | 1833 | 19427 | 21260 | NA | NA | NA | 36.41 | 0.78 | 26 | 36 | 36.4 | 36.8 | 42.5 | ▁▁▁▁▆▇▁▁ |
| numeric | last_weight | 0 | 21260 | 21260 | NA | NA | NA | 59.88 | 12.93 | 0 | 52 | 59 | 66 | 181 | ▁▁▇▂▁▁▁▁ |
| numeric | max_cd4_perc | 21251 | 9 | 21260 | NA | NA | NA | 35.78 | 30.59 | 2 | 13 | 26 | 49 | 98 | ▇▅▁▅▂▁▁▂ |
| numeric | max_dbp | 682 | 20578 | 21260 | NA | NA | NA | 75.57 | 11 | 0 | 70 | 76 | 81 | 156 | ▁▁▁▇▇▁▁▁ |
| numeric | max_oxy_sat | 3708 | 17552 | 21260 | NA | NA | NA | 97.4 | 3.99 | 0 | 97 | 98 | 99 | 100 | ▁▁▁▁▁▁▁▇ |
| numeric | max_pulse | 659 | 20601 | 21260 | NA | NA | NA | 91.92 | 17.25 | 0 | 82 | 92 | 98 | 204 | ▁▁▂▇▁▁▁▁ |
| numeric | max_sbp | 678 | 20582 | 21260 | NA | NA | NA | 117.11 | 20.33 | 0 | 100 | 120 | 130 | 243 | ▁▁▂▇▅▁▁▁ |
| numeric | max_temp | 1833 | 19427 | 21260 | NA | NA | NA | 36.81 | 0.78 | 26 | 36.4 | 36.8 | 37.2 | 42.5 | ▁▁▁▁▂▇▁▁ |
| numeric | max_weight | 0 | 21260 | 21260 | NA | NA | NA | 61.34 | 12.56 | 0 | 53 | 60 | 68 | 181 | ▁▁▇▃▁▁▁▁ |
| numeric | min_cd4_perc | 21251 | 9 | 21260 | NA | NA | NA | 35.78 | 30.59 | 2 | 13 | 26 | 49 | 98 | ▇▅▁▅▂▁▁▂ |
| numeric | min_dbp | 682 | 20578 | 21260 | NA | NA | NA | 66.74 | 13.1 | 0 | 60 | 64 | 71 | 161 | ▁▁▇▇▁▁▁▁ |
| numeric | min_oxy_sat | 3708 | 17552 | 21260 | NA | NA | NA | 95.37 | 7.65 | 0 | 95 | 97 | 98 | 100 | ▁▁▁▁▁▁▁▇ |
| numeric | min_pulse | 659 | 20601 | 21260 | NA | NA | NA | 90.32 | 21.25 | 0 | 74 | 92 | 104 | 214 | ▁▁▆▇▃▁▁▁ |
| numeric | min_sbp | 678 | 20582 | 21260 | NA | NA | NA | 109.49 | 14.99 | 0 | 100 | 108 | 117 | 243 | ▁▁▁▇▁▁▁▁ |
| numeric | min_temp | 1833 | 19427 | 21260 | NA | NA | NA | 36.07 | 0.89 | 25.8 | 35.7 | 36.1 | 36.5 | 42 | ▁▁▁▁▃▇▁▁ |
| numeric | min_weight | 0 | 21260 | 21260 | NA | NA | NA | 57.84 | 14.83 | 0 | 50 | 56 | 64 | 187 | ▁▂▇▁▁▁▁▁ |
| numeric | prop_bad_adherence | 0 | 21260 | 21260 | NA | NA | NA | 0.0091 | 0.064 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| numeric | prop_days_on_arvs | 0 | 21260 | 21260 | NA | NA | NA | -Inf | NaN | -Inf | 0 | 0 | 0.048 | 1 | ▇▁▁▁▁▁▁▁ |
| numeric | prop_days_on_tb_meds | 0 | 21260 | 21260 | NA | NA | NA | 0.025 | 0.14 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| numeric | prop_days_on_tb_prop | 0 | 21260 | 21260 | NA | NA | NA | 0.053 | 0.17 | 0 | 0 | 0 | 0 | 0.99 | ▇▁▁▁▁▁▁▁ |
| numeric | prop_defaulted_apptmts | 0 | 21260 | 21260 | NA | NA | NA | 0.039 | 0.11 | 0 | 0 | 0 | 0 | 0.8 | ▇▁▁▁▁▁▁▁ |
| numeric | vl_count_1_log | 13536 | 7724 | 21260 | NA | NA | NA | -2.72 | 9.54 | -11.51 | -11.51 | -11.51 | 6.37 | 15.89 | ▇▁▁▁▂▃▂▁ |
| numeric | vl2_suppression | 16645 | 4615 | 21260 | NA | NA | NA | 0.8 | 0.4 | 0 | 1 | 1 | 1 | 1 | ▂▁▁▁▁▁▁▇ |
| numeric | vl3_suppression | 18850 | 2410 | 21260 | NA | NA | NA | 0.79 | 0.4 | 0 | 1 | 1 | 1 | 1 | ▂▁▁▁▁▁▁▇ |
| numeric | vl4_suppression | 20288 | 972 | 21260 | NA | NA | NA | 0.79 | 0.41 | 0 | 1 | 1 | 1 | 1 | ▂▁▁▁▁▁▁▇ |
| numeric | vl5_suppression | 21016 | 244 | 21260 | NA | NA | NA | 0.69 | 0.46 | 0 | 0 | 1 | 1 | 1 | ▃▁▁▁▁▁▁▇ |
Training and Validation Sets
The dataset was split into training and out of sample validation set by ratio .9:.1 i.e 4615 (90%) and 512 (10%)
Logistic Regression Model
We used logistic regression estimates the probability of an outcome. Events are coded as binary variables with a value of 1 representing suppression, and a value of zero representing treatment failure
Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
ifelse(type == : prediction from a rank-deficient fit may be misleading
Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
ifelse(type == : prediction from a rank-deficient fit may be misleading
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5706 -0.9406 0.5530 0.8488 2.6402
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.811e-01 1.935e-01 1.452 0.146381
first_age 5.796e-03 2.996e-03 1.935 0.053018 .
bmi 1.666e-04 2.077e-04 0.802 0.422385
prop_days_on_arvs -5.293e-01 1.037e-01 -5.104 3.32e-07 ***
prop_days_on_tb_meds 1.370e-01 2.023e-01 0.677 0.498447
prop_days_on_tb_prop 7.863e-01 1.883e-01 4.176 2.97e-05 ***
prop_defaulted_apptmts -6.869e-01 2.899e-01 -2.370 0.017809 *
prop_bad_adherence -2.251e-01 5.701e-01 -0.395 0.692949
num_encounters 1.466e-02 9.382e-03 1.563 0.118139
vl_count_1_log 2.069e-03 4.853e-03 0.426 0.669849
first_who_stage 4.310e-02 3.415e-02 1.262 0.206876
first_arv_line -2.390e-01 7.882e-02 -3.032 0.002432 **
is_male -6.773e-01 6.559e-02 -10.326 < 2e-16 ***
is_status_disclosed 2.096e-01 1.649e-01 1.271 0.203849
is_on_contraceptive 5.759e-01 8.660e-02 6.651 2.91e-11 ***
is_on_health_cover -4.904e-01 6.553e-02 -7.483 7.25e-14 ***
is_on_cryptococcus_tx -4.373e-01 3.223e-01 -1.357 0.174839
is_on_tb_prophy_regimen 8.352e-02 6.858e-02 1.218 0.223281
has_sti_symptoms -4.931e-01 2.174e-01 -2.269 0.023293 *
has_tb_symptoms -2.070e-01 1.047e-01 -1.978 0.047970 *
has_drug_tox_efcts 5.644e-01 3.410e-01 1.655 0.097901 .
has_toxic_drug 4.264e-03 5.486e-01 0.008 0.993798
has_referral_order 1.511e-01 7.034e-02 2.148 0.031693 *
has_phdp_referral 3.741e-01 1.071e-01 3.494 0.000477 ***
needs_fam_tx_support 5.621e-01 8.993e-02 6.250 4.10e-10 ***
has_changed_pcp -1.567e+00 3.899e-01 -4.018 5.87e-05 ***
has_changed_tb_tx 1.061e-01 3.130e-01 0.339 0.734711
has_restarted_tb_tx 1.058e+00 4.042e-01 2.617 0.008879 **
has_been_hospitalized -5.383e-01 1.786e-01 -3.015 0.002571 **
has_sulf_peni_rxns -2.163e-01 5.538e-01 -0.391 0.696065
is_general_pexam -2.848e-01 1.034e-01 -2.754 0.005894 **
is_skin_pexam 4.121e-01 1.325e-01 3.111 0.001865 **
is_lymph_nodes_pexam 5.915e-01 3.188e-01 1.855 0.063530 .
is_respiratory_pexam 6.415e-01 2.289e-01 2.803 0.005069 **
is_heent_pexam -8.804e-01 2.895e-01 -3.042 0.002353 **
is_cardiac_pexam 1.191e+01 1.701e+02 0.070 0.944160
is_abdominal_pexam 1.132e+00 5.236e-01 2.162 0.030603 *
is_urogenital_pexam 1.239e-01 7.945e-02 1.559 0.118941
is_extremies_pexam -2.516e-01 4.389e-01 -0.573 0.566470
is_psychiatric_pexam 7.463e-01 8.262e-01 0.903 0.366382
is_neurologic_pexam -4.674e-01 7.192e-01 -0.650 0.515757
is_musculoskeletal_pexam 4.173e-01 8.919e-01 0.468 0.639841
is_cxr_code_labs 2.151e-01 1.920e-01 1.120 0.262604
is_underweight 2.383e-01 9.167e-02 2.600 0.009330 **
has_high_bp -4.214e-02 6.544e-02 -0.644 0.519615
has_low_bp -9.202e-01 2.989e-01 -3.079 0.002077 **
has_abnormal_oxy_sat 9.641e-01 2.921e-01 3.301 0.000964 ***
has_fever -3.819e-01 2.366e-01 -1.614 0.106515
virologic_failure1 -2.019e+00 1.071e-01 -18.850 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8623.8 on 6313 degrees of freedom
Residual deviance: 7045.9 on 6265 degrees of freedom
AIC: 7143.9
Number of Fisher Scoring iterations: 11
Model Discrimination Analyisis
Call:
roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~ dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best", print.thres = "best", print.auc.y = 4, main = modelName, percent = TRUE, ci = F, of = "thresholds")
Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
Area under the curve: 72.04%
Confusion Matrix
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 568 981
0 334 2732
Accuracy : 0.7151
95% CI : (0.7018, 0.7281)
No Information Rate : 0.8046
P-Value [Acc > NIR] : 1
Kappa : 0.2875
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.6297
Specificity : 0.7358
Pos Pred Value : 0.3667
Neg Pred Value : 0.8911
Prevalence : 0.1954
Detection Rate : 0.1231
Detection Prevalence : 0.3356
Balanced Accuracy : 0.6828
'Positive' Class : 1
Out of Sample Validation
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 114 144
0 86 656
Accuracy : 0.77
95% CI : (0.7426, 0.7958)
No Information Rate : 0.8
P-Value [Acc > NIR] : 0.991248
Kappa : 0.3517
Mcnemar's Test P-Value : 0.000171
Sensitivity : 0.5700
Specificity : 0.8200
Pos Pred Value : 0.4419
Neg Pred Value : 0.8841
Prevalence : 0.2000
Detection Rate : 0.1140
Detection Prevalence : 0.2580
Balanced Accuracy : 0.6950
'Positive' Class : 1
Penalised Logistic Regression Model
Model summary
49 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 0.3659385347
first_age .
bmi .
prop_days_on_arvs -0.0682323886
prop_days_on_tb_meds 0.0433872586
prop_days_on_tb_prop 0.5750124444
prop_defaulted_apptmts .
prop_bad_adherence .
num_encounters 0.0043909959
vl_count_1_log -0.0081681485
first_who_stage 0.0529748212
first_arv_line .
is_male -0.4227475399
is_status_disclosed .
is_on_contraceptive 0.3585342405
is_on_health_cover -0.2331863196
is_on_cryptococcus_tx .
is_on_tb_prophy_regimen 0.0773758945
has_sti_symptoms .
has_tb_symptoms -0.0013162333
has_drug_tox_efcts .
has_toxic_drug .
has_referral_order 0.1288846112
has_phdp_referral 0.1098604608
needs_fam_tx_support 0.2633168165
has_changed_pcp -0.0139558525
has_changed_tb_tx 0.1286133577
has_restarted_tb_tx 0.0846502319
has_been_hospitalized -0.6484629552
has_sulf_peni_rxns -0.1377638407
is_general_pexam .
is_skin_pexam 0.0277588432
is_lymph_nodes_pexam .
is_respiratory_pexam 0.0103840844
is_heent_pexam -0.2705097683
is_cardiac_pexam .
is_abdominal_pexam 0.3233126126
is_urogenital_pexam 0.0008297731
is_extremies_pexam .
is_psychiatric_pexam .
is_neurologic_pexam .
is_musculoskeletal_pexam 0.0059279673
is_cxr_code_labs .
is_underweight 0.0100111218
has_high_bp .
has_low_bp -0.0161505544
has_abnormal_oxy_sat .
has_fever .
virologic_failure1 -1.4657655240
Model Discrimination Analyisis
Call:
roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~ dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best", print.thres = "best", print.auc.y = 4, main = modelName, percent = TRUE, ci = F, of = "thresholds")
Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
Area under the curve: 71.55%
Confusion Matrix
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 539 829
0 363 2884
Accuracy : 0.7417
95% CI : (0.7288, 0.7543)
No Information Rate : 0.8046
P-Value [Acc > NIR] : 1
Kappa : 0.3131
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.5976
Specificity : 0.7767
Pos Pred Value : 0.3940
Neg Pred Value : 0.8882
Prevalence : 0.1954
Detection Rate : 0.1168
Detection Prevalence : 0.2964
Balanced Accuracy : 0.6871
'Positive' Class : 1
Out of Sample Validation
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 100 105
0 100 695
Accuracy : 0.795
95% CI : (0.7686, 0.8196)
No Information Rate : 0.8
P-Value [Acc > NIR] : 0.6705
Kappa : 0.3653
Mcnemar's Test P-Value : 0.7800
Sensitivity : 0.5000
Specificity : 0.8688
Pos Pred Value : 0.4878
Neg Pred Value : 0.8742
Prevalence : 0.2000
Detection Rate : 0.1000
Detection Prevalence : 0.2050
Balanced Accuracy : 0.6844
'Positive' Class : 1
KNN
In pattern and class recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification.
user system elapsed
2.125 0.044 28.202
k-Nearest Neighbors
4615 samples
48 predictor
2 classes: 'Yes', 'No'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 4154, 4152, 4153, 4154, 4153, 4154, ...
Addtional sampling using SMOTE
Resampling results across tuning parameters:
k ROC Sens Spec
44 0.7041066 0.4567277 0.8419174
45 0.6975941 0.4434066 0.8497247
Sens was used to select the optimal model using the largest value.
The final value used for the model was k = 44.
Model Discrimination Analyisis
Call:
roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~ dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best", print.thres = "best", print.auc.y = 4, main = modelName, percent = TRUE, ci = F, of = "thresholds")
Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
Area under the curve: 73.94%
Confusion Matrix
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 563 987
0 339 2726
Accuracy : 0.7127
95% CI : (0.6994, 0.7257)
No Information Rate : 0.8046
P-Value [Acc > NIR] : 1
Kappa : 0.2817
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.6242
Specificity : 0.7342
Pos Pred Value : 0.3632
Neg Pred Value : 0.8894
Prevalence : 0.1954
Detection Rate : 0.1220
Detection Prevalence : 0.3359
Balanced Accuracy : 0.6792
'Positive' Class : 1
Out of Sample Validation
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 119 203
0 81 597
Accuracy : 0.716
95% CI : (0.6869, 0.7438)
No Information Rate : 0.8
P-Value [Acc > NIR] : 1
Kappa : 0.2777
Mcnemar's Test P-Value : 6.97e-13
Sensitivity : 0.5950
Specificity : 0.7462
Pos Pred Value : 0.3696
Neg Pred Value : 0.8805
Prevalence : 0.2000
Detection Rate : 0.1190
Detection Prevalence : 0.3220
Balanced Accuracy : 0.6706
'Positive' Class : 1
please try https://rpubs.com/chengjiun/52658
Classification and Regression Trees (CART)
Classification and Regression Trees (CART) were first introducted in 1984 by a group led by Leo Briemann (Brieman et al. 1984). The CART algorithm provided a means to sequentially conduct binary splits on variables provided to the algorithm, resulting in a decision structure that resembles its namesake, a tree.
CART
4615 samples
48 predictor
2 classes: 'Yes', 'No'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 4154, 4154, 4153, 4153, 4153, 4154, ...
Addtional sampling using SMOTE
Resampling results across tuning parameters:
cp ROC Sens Spec
0e+00 0.6489236 0.3912698 0.8661428
1e-04 0.6502539 0.3890476 0.8720655
2e-04 0.6595358 0.3801709 0.8841811
3e-04 0.6670730 0.3723932 0.8973814
4e-04 0.6738526 0.3657387 0.9057321
5e-04 0.6805319 0.3557631 0.9178535
6e-04 0.6824817 0.3546520 0.9210873
7e-04 0.6870988 0.3591087 0.9254043
8e-04 0.6870988 0.3591087 0.9254043
9e-04 0.6884815 0.3535653 0.9289040
1e-03 0.6874124 0.3535409 0.9289040
Sens was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.
Tree Diagram
Model Discrimination Analyisis
Call:
roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~ dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best", print.thres = "best", print.auc.y = 4, main = modelName, percent = TRUE, ci = F, of = "thresholds")
Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
Area under the curve: 83.36%
Confusion Matrix
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 566 399
0 336 3314
Accuracy : 0.8407
95% CI : (0.8299, 0.8512)
No Information Rate : 0.8046
P-Value [Acc > NIR] : 1.189e-10
Kappa : 0.5066
Mcnemar's Test P-Value : 0.0222
Sensitivity : 0.6275
Specificity : 0.8925
Pos Pred Value : 0.5865
Neg Pred Value : 0.9079
Prevalence : 0.1954
Detection Rate : 0.1226
Detection Prevalence : 0.2091
Balanced Accuracy : 0.7600
'Positive' Class : 1
Out of Sample Validation
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 118 72
0 82 728
Accuracy : 0.846
95% CI : (0.8221, 0.8678)
No Information Rate : 0.8
P-Value [Acc > NIR] : 0.0001063
Kappa : 0.5096
Mcnemar's Test P-Value : 0.4683044
Sensitivity : 0.5900
Specificity : 0.9100
Pos Pred Value : 0.6211
Neg Pred Value : 0.8988
Prevalence : 0.2000
Detection Rate : 0.1180
Detection Prevalence : 0.1900
Balanced Accuracy : 0.7500
'Positive' Class : 1
XGBOOST
https://xgboost.readthedocs.io/en/latest/parameter.html
Relative Importance
Feature Gain Cover Frequency
1: virologic_failure1 0.293615243 0.012821169 0.005049682
2: vl_count_1_log 0.083067386 0.132833167 0.129337026
3: bmi 0.081313208 0.244015642 0.229027529
4: has_referral_order 0.073463121 0.008024496 0.013194331
5: is_male 0.068524036 0.011444079 0.016940870
6: is_on_health_cover 0.062913742 0.009250477 0.016126405
7: first_age 0.045449906 0.093886460 0.115328229
8: prop_days_on_arvs 0.038685462 0.113080302 0.110604333
9: first_arv_line 0.033575324 0.009918803 0.012542759
10: num_encounters 0.030569616 0.030345682 0.062225118
11: has_high_bp 0.022368534 0.008865405 0.012216973
12: prop_days_on_tb_prop 0.018424878 0.066293104 0.053754683
13: is_urogenital_pexam 0.016909827 0.006929173 0.011728295
14: first_who_stage 0.015576339 0.020660829 0.020361622
15: is_on_contraceptive 0.015135534 0.009007978 0.009773579
16: prop_defaulted_apptmts 0.015061570 0.030476917 0.034207526
17: is_on_tb_prophy_regimen 0.014798505 0.006629841 0.017266656
18: has_phdp_referral 0.014453970 0.005445195 0.006352826
19: has_tb_symptoms 0.012362540 0.006140207 0.007493077
20: needs_fam_tx_support 0.009067414 0.006716438 0.010913830
Importance
1: 0.293615243
2: 0.083067386
3: 0.081313208
4: 0.073463121
5: 0.068524036
6: 0.062913742
7: 0.045449906
8: 0.038685462
9: 0.033575324
10: 0.030569616
11: 0.022368534
12: 0.018424878
13: 0.016909827
14: 0.015576339
15: 0.015135534
16: 0.015061570
17: 0.014798505
18: 0.014453970
19: 0.012362540
20: 0.009067414
Model Discrimination Analyisis
Call:
roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~ dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best", print.thres = "best", print.auc.y = 4, main = modelName, percent = TRUE, ci = F, of = "thresholds")
Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
Area under the curve: 97.35%
Confusion Matrix
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 818 215
0 84 3498
Accuracy : 0.9352
95% CI : (0.9277, 0.9421)
No Information Rate : 0.8046
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8047
Mcnemar's Test P-Value : 5.558e-14
Sensitivity : 0.9069
Specificity : 0.9421
Pos Pred Value : 0.7919
Neg Pred Value : 0.9765
Prevalence : 0.1954
Detection Rate : 0.1772
Detection Prevalence : 0.2238
Balanced Accuracy : 0.9245
'Positive' Class : 1
Out of Sample Validation
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 175 37
0 25 763
Accuracy : 0.938
95% CI : (0.9212, 0.9521)
No Information Rate : 0.8
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8105
Mcnemar's Test P-Value : 0.1624
Sensitivity : 0.8750
Specificity : 0.9537
Pos Pred Value : 0.8255
Neg Pred Value : 0.9683
Prevalence : 0.2000
Detection Rate : 0.1750
Detection Prevalence : 0.2120
Balanced Accuracy : 0.9144
'Positive' Class : 1
GBM
https://xgboost.readthedocs.io/en/latest/parameter.html
user system elapsed
5.738 0.072 121.354
Stochastic Gradient Boosting
4615 samples
48 predictor
2 classes: 'Yes', 'No'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 4154, 4154, 4153, 4154, 4153, 4153, ...
Addtional sampling using SMOTE
Resampling results across tuning parameters:
interaction.depth n.trees ROC Sens Spec
9 100 0.7197977 0.3736386 0.9383235
9 300 0.7054051 0.3802930 0.9210851
9 400 0.7045447 0.3825397 0.9181252
10 100 0.7218032 0.3768864 0.9380467
10 300 0.7144480 0.3990965 0.9248536
10 400 0.7080026 0.3968864 0.9127416
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 20
ROC was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 100,
interaction.depth = 10, shrinkage = 0.1 and n.minobsinnode = 20.
Relative Importance
Model Discrimination Analyisis
Call:
roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~ dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best", print.thres = "best", print.auc.y = 4, main = modelName, percent = TRUE, ci = F, of = "thresholds")
Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
Area under the curve: 85.37%
Confusion Matrix
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 472 241
0 430 3472
Accuracy : 0.8546
95% CI : (0.8441, 0.8647)
No Information Rate : 0.8046
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4979
Mcnemar's Test P-Value : 3.938e-13
Sensitivity : 0.5233
Specificity : 0.9351
Pos Pred Value : 0.6620
Neg Pred Value : 0.8898
Prevalence : 0.1954
Detection Rate : 0.1023
Detection Prevalence : 0.1545
Balanced Accuracy : 0.7292
'Positive' Class : 1
Out of Sample Validation
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 89 29
0 111 771
Accuracy : 0.86
95% CI : (0.8369, 0.8809)
No Information Rate : 0.8
P-Value [Acc > NIR] : 4.791e-07
Kappa : 0.483
Mcnemar's Test P-Value : 7.608e-12
Sensitivity : 0.4450
Specificity : 0.9637
Pos Pred Value : 0.7542
Neg Pred Value : 0.8741
Prevalence : 0.2000
Detection Rate : 0.0890
Detection Prevalence : 0.1180
Balanced Accuracy : 0.7044
'Positive' Class : 1
Random Forest
Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new “forest”, and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification). Breiman and Cutler’s random forest approach is implimented via the randomForest package.
user system elapsed
15.809 0.168 141.097
Model Discrimination Analyisis
Call:
roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~ dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best", print.thres = "best", print.auc.y = 4, main = modelName, percent = TRUE, ci = F, of = "thresholds")
Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
Area under the curve: 98.47%
Confusion Matrix
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 795 143
0 107 3570
Accuracy : 0.9458
95% CI : (0.9389, 0.9522)
No Information Rate : 0.8046
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8303
Mcnemar's Test P-Value : 0.02686
Sensitivity : 0.8814
Specificity : 0.9615
Pos Pred Value : 0.8475
Neg Pred Value : 0.9709
Prevalence : 0.1954
Detection Rate : 0.1723
Detection Prevalence : 0.2033
Balanced Accuracy : 0.9214
'Positive' Class : 1
Relative Importance of Variables
According to https://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.html, “the mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest
| MeanDecreaseGini | |
|---|---|
| first_age | 133.19 |
| bmi | 141.84 |
| prop_days_on_arvs | 110.83 |
| prop_days_on_tb_meds | 38.00 |
| prop_days_on_tb_prop | 62.79 |
| prop_defaulted_apptmts | 76.49 |
| prop_bad_adherence | 15.16 |
| num_encounters | 112.13 |
| vl_count_1_log | 322.87 |
| first_who_stage | 71.64 |
| first_arv_line | 107.67 |
| is_male | 170.84 |
| is_status_disclosed | 12.20 |
| is_on_contraceptive | 108.82 |
| is_on_health_cover | 109.04 |
| is_on_cryptococcus_tx | 4.30 |
| is_on_tb_prophy_regimen | 129.16 |
| has_sti_symptoms | 9.48 |
| has_tb_symptoms | 39.53 |
| has_drug_tox_efcts | 3.30 |
| has_toxic_drug | 1.35 |
| has_referral_order | 110.91 |
| has_phdp_referral | 52.22 |
| needs_fam_tx_support | 47.79 |
| has_changed_pcp | 5.99 |
| has_changed_tb_tx | 3.52 |
| has_restarted_tb_tx | 2.48 |
| has_been_hospitalized | 36.63 |
| has_sulf_peni_rxns | 1.10 |
| is_general_pexam | 32.96 |
| is_skin_pexam | 14.97 |
| is_lymph_nodes_pexam | 3.28 |
| is_respiratory_pexam | 6.27 |
| is_heent_pexam | 5.10 |
| is_cardiac_pexam | 0.06 |
| is_abdominal_pexam | 1.06 |
| is_urogenital_pexam | 55.49 |
| is_extremies_pexam | 2.70 |
| is_psychiatric_pexam | 0.13 |
| is_neurologic_pexam | 0.61 |
| is_musculoskeletal_pexam | 0.23 |
| is_cxr_code_labs | 8.53 |
| is_underweight | 32.72 |
| has_high_bp | 110.37 |
| has_low_bp | 5.18 |
| has_abnormal_oxy_sat | 4.64 |
| has_fever | 7.68 |
| virologic_failure1 | 374.03 |
Error Rate
This plot shows the class error rates of the random forest model. As the number of trees increases, the error rate approaches zero.
Out of Sample Validation
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 168 21
0 32 779
Accuracy : 0.947
95% CI : (0.9312, 0.9601)
No Information Rate : 0.8
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8309
Mcnemar's Test P-Value : 0.1696
Sensitivity : 0.8400
Specificity : 0.9738
Pos Pred Value : 0.8889
Neg Pred Value : 0.9605
Prevalence : 0.2000
Detection Rate : 0.1680
Detection Prevalence : 0.1890
Balanced Accuracy : 0.9069
'Positive' Class : 1
BART (Bayesian Additive Regression Trees)
SVM
user system elapsed
17.129 3.494 215.886
Support Vector Machines with Radial Basis Function Kernel
4615 samples
48 predictor
2 classes: 'Yes', 'No'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 4153, 4154, 4154, 4154, 4154, 4153, ...
Addtional sampling using SMOTE
Resampling results across tuning parameters:
C ROC Sens Spec
0.25 0.6810117 0.4190965 0.8685701
0.50 0.6943176 0.4268254 0.8742305
1.00 0.6998193 0.4102076 0.8736921
Tuning parameter 'sigma' was held constant at a value of 0.02132118
ROC was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.02132118 and C = 1.
Model Discrimination Analyisis
Call:
roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~ dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best", print.thres = "best", print.auc.y = 4, main = modelName, percent = TRUE, ci = F, of = "thresholds")
Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
Area under the curve: 83.19%
Confusion Matrix
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 636 609
0 266 3104
Accuracy : 0.8104
95% CI : (0.7988, 0.8216)
No Information Rate : 0.8046
P-Value [Acc > NIR] : 0.1627
Kappa : 0.473
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.7051
Specificity : 0.8360
Pos Pred Value : 0.5108
Neg Pred Value : 0.9211
Prevalence : 0.1954
Detection Rate : 0.1378
Detection Prevalence : 0.2698
Balanced Accuracy : 0.7705
'Positive' Class : 1
Out of Sample Validation
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 124 102
0 76 698
Accuracy : 0.822
95% CI : (0.7969, 0.8452)
No Information Rate : 0.8
P-Value [Acc > NIR] : 0.04311
Kappa : 0.4696
Mcnemar's Test P-Value : 0.06095
Sensitivity : 0.6200
Specificity : 0.8725
Pos Pred Value : 0.5487
Neg Pred Value : 0.9018
Prevalence : 0.2000
Detection Rate : 0.1240
Detection Prevalence : 0.2260
Balanced Accuracy : 0.7463
'Positive' Class : 1
Comparative Analysis
Sensitivity And Specificity Analysis
Call:
summary.resamples(object = resamps)
Models: XGBoost, GBM, BART, RandomForest, SVMRadial, GLMNET, OLSLogistic, CART, KNN
Number of resamples: 10
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max.
XGBoost 0.8596014 0.9088609 0.9322113 0.9289775 0.9481066 1.0000000
GBM 0.9188805 0.9237962 0.9349068 0.9397287 0.9468209 0.9808322
BART 0.8943432 0.9140283 0.9230949 0.9322802 0.9506779 0.9914611
RandomForest 0.8747605 0.8934152 0.9021964 0.9048755 0.9226003 0.9293582
SVMRadial 0.7824000 0.8392000 0.8824000 0.8762667 0.9132000 0.9434667
GLMNET 0.7482482 0.8369514 0.8561110 0.8510192 0.8746921 0.9079079
OLSLogistic 0.7592202 0.8080992 0.8386175 0.8379713 0.8610967 0.9178082
CART 0.7633300 0.8455481 0.8918181 0.8723098 0.9055649 0.9350198
KNN 0.7171487 0.7736976 0.8128080 0.8249873 0.8974142 0.9308732
NA's
XGBoost 0
GBM 0
BART 0
RandomForest 0
SVMRadial 0
GLMNET 0
OLSLogistic 0
CART 0
KNN 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max.
XGBoost 0.7741935 0.7875504 0.8225806 0.8399194 0.8641633 1.0000000
GBM 0.7812500 0.8387097 0.8573589 0.8560484 0.8709677 0.9354839
BART 0.7741935 0.8190524 0.8387097 0.8365927 0.8641633 0.8709677
RandomForest 0.6206897 0.7500000 0.7857143 0.7604680 0.8143473 0.8275862
SVMRadial 0.6000000 0.6900000 0.7400000 0.7360000 0.7600000 0.8800000
GLMNET 0.5555556 0.6538462 0.6794872 0.6851852 0.7307692 0.7692308
OLSLogistic 0.5769231 0.6356838 0.6730769 0.6962963 0.7382479 0.9230769
CART 0.6071429 0.6958128 0.7721675 0.7498768 0.7912562 0.9285714
KNN 0.4814815 0.5808405 0.7307692 0.7052707 0.8269231 0.8846154
NA's
XGBoost 0
GBM 0
BART 0
RandomForest 0
SVMRadial 0
GLMNET 0
OLSLogistic 0
CART 0
KNN 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max.
XGBoost 0.8985507 0.9420290 0.9489557 0.9564152 0.9817775 1.0000000
GBM 0.9130435 0.9600384 0.9708014 0.9622336 0.9710145 0.9855072
BART 0.8405797 0.9130435 0.9492754 0.9376385 0.9670716 1.0000000
RandomForest 0.9436620 0.9444444 0.9444444 0.9567488 0.9683099 0.9861111
SVMRadial 0.8666667 0.9333333 0.9333333 0.9360000 0.9566667 0.9733333
GLMNET 0.8783784 0.9220659 0.9388190 0.9415957 0.9695946 0.9864865
OLSLogistic 0.8378378 0.8949463 0.9183636 0.9117734 0.9324324 0.9459459
CART 0.8732394 0.8990610 0.9305556 0.9272692 0.9546655 0.9722222
KNN 0.7671233 0.8141892 0.8503332 0.8410959 0.8614865 0.9324324
NA's
XGBoost 0
GBM 0
BART 0
RandomForest 0
SVMRadial 0
GLMNET 0
OLSLogistic 0
CART 0
KNN 0
ROCs
## Discriminatory
Calibration
https://stats.stackexchange.com/questions/261835/interpretation-of-calibration-curve file:///home/dnn/Downloads/02%20(1).pdf https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html
Model Interpretation
Cutoff Optimization
| Model | Cutoff | Sens | Spec | Accuracy | PPV | NPV | F1 | Bal Acc | Kappa |
|---|---|---|---|---|---|---|---|---|---|
| XGBoost | 0.10 | 0.88 | 0.90 | 0.89 | 0.81 | 0.94 | 0.84 | 0.89 | 0.76 |
| XGBoost | 0.15 | 0.87 | 0.92 | 0.90 | 0.83 | 0.94 | 0.85 | 0.89 | 0.78 |
| XGBoost | 0.20 | 0.87 | 0.92 | 0.91 | 0.84 | 0.94 | 0.85 | 0.90 | 0.79 |
| XGBoost | 0.25 | 0.86 | 0.94 | 0.91 | 0.86 | 0.94 | 0.86 | 0.90 | 0.80 |
| XGBoost | 0.30 | 0.85 | 0.94 | 0.91 | 0.87 | 0.93 | 0.86 | 0.90 | 0.80 |
| XGBoost | 0.35 | 0.85 | 0.95 | 0.92 | 0.89 | 0.93 | 0.87 | 0.90 | 0.81 |
| XGBoost | 0.40 | 0.85 | 0.95 | 0.92 | 0.89 | 0.93 | 0.87 | 0.90 | 0.81 |
| XGBoost | 0.45 | 0.85 | 0.95 | 0.92 | 0.90 | 0.93 | 0.87 | 0.90 | 0.81 |
| XGBoost | 0.50 | 0.84 | 0.96 | 0.92 | 0.90 | 0.93 | 0.87 | 0.90 | 0.81 |
| BART | 0.10 | 0.90 | 0.87 | 0.88 | 0.77 | 0.95 | 0.83 | 0.89 | 0.74 |
| BART | 0.15 | 0.89 | 0.91 | 0.90 | 0.82 | 0.95 | 0.85 | 0.90 | 0.78 |
| BART | 0.20 | 0.89 | 0.93 | 0.92 | 0.86 | 0.95 | 0.87 | 0.91 | 0.81 |
| BART | 0.25 | 0.88 | 0.94 | 0.92 | 0.88 | 0.95 | 0.88 | 0.91 | 0.82 |
| BART | 0.30 | 0.88 | 0.94 | 0.92 | 0.88 | 0.95 | 0.88 | 0.91 | 0.82 |
| BART | 0.35 | 0.87 | 0.95 | 0.92 | 0.89 | 0.94 | 0.88 | 0.91 | 0.82 |
| BART | 0.40 | 0.87 | 0.95 | 0.93 | 0.90 | 0.94 | 0.88 | 0.91 | 0.83 |
| BART | 0.45 | 0.86 | 0.96 | 0.93 | 0.91 | 0.94 | 0.88 | 0.91 | 0.83 |
| BART | 0.50 | 0.86 | 0.96 | 0.93 | 0.91 | 0.94 | 0.88 | 0.91 | 0.83 |
| GBM | 0.10 | 0.92 | 0.75 | 0.81 | 0.63 | 0.96 | 0.75 | 0.84 | 0.60 |
| GBM | 0.15 | 0.91 | 0.80 | 0.83 | 0.67 | 0.95 | 0.77 | 0.85 | 0.65 |
| GBM | 0.20 | 0.91 | 0.83 | 0.85 | 0.71 | 0.95 | 0.80 | 0.87 | 0.69 |
| GBM | 0.25 | 0.89 | 0.86 | 0.87 | 0.76 | 0.95 | 0.82 | 0.88 | 0.72 |
| GBM | 0.30 | 0.88 | 0.89 | 0.88 | 0.78 | 0.94 | 0.83 | 0.88 | 0.74 |
| GBM | 0.35 | 0.87 | 0.91 | 0.90 | 0.82 | 0.94 | 0.84 | 0.89 | 0.76 |
| GBM | 0.40 | 0.86 | 0.93 | 0.91 | 0.85 | 0.93 | 0.85 | 0.89 | 0.78 |
| GBM | 0.45 | 0.85 | 0.93 | 0.91 | 0.86 | 0.93 | 0.85 | 0.89 | 0.78 |
| GBM | 0.50 | 0.84 | 0.94 | 0.91 | 0.87 | 0.93 | 0.85 | 0.89 | 0.78 |
| RandomForest | 0.10 | 0.92 | 0.53 | 0.64 | 0.44 | 0.94 | 0.59 | 0.73 | 0.34 |
| RandomForest | 0.15 | 0.89 | 0.75 | 0.79 | 0.59 | 0.95 | 0.71 | 0.82 | 0.55 |
| RandomForest | 0.20 | 0.84 | 0.85 | 0.85 | 0.70 | 0.93 | 0.76 | 0.84 | 0.65 |
| RandomForest | 0.25 | 0.82 | 0.90 | 0.88 | 0.77 | 0.93 | 0.79 | 0.86 | 0.71 |
| RandomForest | 0.30 | 0.79 | 0.93 | 0.89 | 0.82 | 0.92 | 0.80 | 0.86 | 0.73 |
| RandomForest | 0.35 | 0.79 | 0.94 | 0.90 | 0.85 | 0.92 | 0.82 | 0.87 | 0.75 |
| RandomForest | 0.40 | 0.77 | 0.95 | 0.90 | 0.87 | 0.91 | 0.82 | 0.86 | 0.75 |
| RandomForest | 0.45 | 0.77 | 0.96 | 0.90 | 0.88 | 0.91 | 0.82 | 0.86 | 0.75 |
| RandomForest | 0.50 | 0.76 | 0.96 | 0.90 | 0.88 | 0.91 | 0.81 | 0.86 | 0.75 |
| SVMRadial | 0.10 | 0.83 | 0.73 | 0.76 | 0.51 | 0.93 | 0.63 | 0.78 | 0.47 |
| SVMRadial | 0.15 | 0.82 | 0.81 | 0.81 | 0.60 | 0.93 | 0.69 | 0.82 | 0.56 |
| SVMRadial | 0.20 | 0.81 | 0.84 | 0.83 | 0.63 | 0.93 | 0.71 | 0.83 | 0.59 |
| SVMRadial | 0.25 | 0.81 | 0.86 | 0.85 | 0.66 | 0.93 | 0.73 | 0.83 | 0.62 |
| SVMRadial | 0.30 | 0.80 | 0.88 | 0.86 | 0.70 | 0.93 | 0.74 | 0.84 | 0.65 |
| SVMRadial | 0.35 | 0.79 | 0.90 | 0.87 | 0.72 | 0.93 | 0.75 | 0.84 | 0.66 |
| SVMRadial | 0.40 | 0.78 | 0.91 | 0.88 | 0.75 | 0.92 | 0.76 | 0.85 | 0.68 |
| SVMRadial | 0.45 | 0.76 | 0.93 | 0.89 | 0.79 | 0.92 | 0.77 | 0.84 | 0.70 |
| SVMRadial | 0.50 | 0.74 | 0.94 | 0.89 | 0.80 | 0.91 | 0.76 | 0.84 | 0.69 |
| GLMNET | 0.10 | 0.95 | 0.22 | 0.41 | 0.30 | 0.93 | 0.46 | 0.59 | 0.10 |
| GLMNET | 0.15 | 0.87 | 0.48 | 0.59 | 0.38 | 0.91 | 0.53 | 0.68 | 0.25 |
| GLMNET | 0.20 | 0.81 | 0.66 | 0.70 | 0.46 | 0.91 | 0.59 | 0.73 | 0.38 |
| GLMNET | 0.25 | 0.78 | 0.74 | 0.75 | 0.52 | 0.90 | 0.63 | 0.76 | 0.45 |
| GLMNET | 0.30 | 0.77 | 0.82 | 0.81 | 0.61 | 0.91 | 0.68 | 0.80 | 0.55 |
| GLMNET | 0.35 | 0.75 | 0.88 | 0.85 | 0.70 | 0.91 | 0.72 | 0.82 | 0.62 |
| GLMNET | 0.40 | 0.72 | 0.91 | 0.86 | 0.75 | 0.90 | 0.74 | 0.82 | 0.64 |
| GLMNET | 0.45 | 0.71 | 0.93 | 0.87 | 0.80 | 0.90 | 0.75 | 0.82 | 0.66 |
| GLMNET | 0.50 | 0.69 | 0.94 | 0.87 | 0.82 | 0.89 | 0.74 | 0.81 | 0.66 |
| OLSLogistic | 0.10 | 0.84 | 0.55 | 0.63 | 0.40 | 0.91 | 0.54 | 0.70 | 0.29 |
| OLSLogistic | 0.15 | 0.80 | 0.65 | 0.69 | 0.45 | 0.90 | 0.58 | 0.72 | 0.36 |
| OLSLogistic | 0.20 | 0.78 | 0.73 | 0.74 | 0.51 | 0.90 | 0.62 | 0.75 | 0.44 |
| OLSLogistic | 0.25 | 0.76 | 0.79 | 0.78 | 0.57 | 0.90 | 0.65 | 0.77 | 0.49 |
| OLSLogistic | 0.30 | 0.74 | 0.83 | 0.81 | 0.62 | 0.90 | 0.67 | 0.79 | 0.54 |
| OLSLogistic | 0.35 | 0.73 | 0.86 | 0.83 | 0.66 | 0.90 | 0.68 | 0.79 | 0.56 |
| OLSLogistic | 0.40 | 0.71 | 0.88 | 0.84 | 0.69 | 0.90 | 0.69 | 0.80 | 0.58 |
| OLSLogistic | 0.45 | 0.70 | 0.90 | 0.85 | 0.71 | 0.90 | 0.70 | 0.80 | 0.60 |
| OLSLogistic | 0.50 | 0.70 | 0.91 | 0.86 | 0.74 | 0.89 | 0.71 | 0.80 | 0.62 |
| CART | 0.10 | 0.80 | 0.89 | 0.86 | 0.75 | 0.92 | 0.77 | 0.84 | 0.67 |
| CART | 0.15 | 0.79 | 0.92 | 0.88 | 0.80 | 0.92 | 0.79 | 0.85 | 0.71 |
| CART | 0.20 | 0.79 | 0.92 | 0.88 | 0.80 | 0.92 | 0.79 | 0.85 | 0.71 |
| CART | 0.25 | 0.79 | 0.92 | 0.88 | 0.81 | 0.92 | 0.79 | 0.85 | 0.71 |
| CART | 0.30 | 0.79 | 0.92 | 0.88 | 0.81 | 0.92 | 0.79 | 0.85 | 0.71 |
| CART | 0.35 | 0.78 | 0.92 | 0.88 | 0.81 | 0.91 | 0.79 | 0.85 | 0.71 |
| CART | 0.40 | 0.76 | 0.92 | 0.88 | 0.80 | 0.91 | 0.78 | 0.84 | 0.70 |
| CART | 0.45 | 0.75 | 0.93 | 0.88 | 0.81 | 0.91 | 0.78 | 0.84 | 0.69 |
| CART | 0.50 | 0.75 | 0.93 | 0.88 | 0.82 | 0.91 | 0.78 | 0.84 | 0.70 |
| KNN | 0.10 | 0.86 | 0.40 | 0.53 | 0.34 | 0.90 | 0.49 | 0.63 | 0.18 |
| KNN | 0.15 | 0.83 | 0.58 | 0.64 | 0.41 | 0.90 | 0.55 | 0.70 | 0.31 |
| KNN | 0.20 | 0.81 | 0.66 | 0.70 | 0.46 | 0.91 | 0.58 | 0.73 | 0.37 |
| KNN | 0.25 | 0.80 | 0.69 | 0.72 | 0.48 | 0.91 | 0.60 | 0.74 | 0.40 |
| KNN | 0.30 | 0.77 | 0.72 | 0.73 | 0.50 | 0.90 | 0.60 | 0.75 | 0.42 |
| KNN | 0.35 | 0.77 | 0.74 | 0.74 | 0.51 | 0.90 | 0.61 | 0.75 | 0.44 |
| KNN | 0.40 | 0.75 | 0.77 | 0.76 | 0.54 | 0.90 | 0.63 | 0.76 | 0.46 |
| KNN | 0.45 | 0.73 | 0.80 | 0.78 | 0.56 | 0.89 | 0.63 | 0.76 | 0.48 |
| KNN | 0.50 | 0.70 | 0.85 | 0.81 | 0.62 | 0.89 | 0.65 | 0.77 | 0.52 |
Stacked Ensembling.
Model Discrimination Analyisis
Confusion Matrix
Relative Importance
Feature Gain Cover Frequency
1: virologic_failure1 0.293615243 0.012821169 0.005049682
2: vl_count_1_log 0.083067386 0.132833167 0.129337026
3: bmi 0.081313208 0.244015642 0.229027529
4: has_referral_order 0.073463121 0.008024496 0.013194331
5: is_male 0.068524036 0.011444079 0.016940870
6: is_on_health_cover 0.062913742 0.009250477 0.016126405
7: first_age 0.045449906 0.093886460 0.115328229
8: prop_days_on_arvs 0.038685462 0.113080302 0.110604333
9: first_arv_line 0.033575324 0.009918803 0.012542759
10: num_encounters 0.030569616 0.030345682 0.062225118
11: has_high_bp 0.022368534 0.008865405 0.012216973
12: prop_days_on_tb_prop 0.018424878 0.066293104 0.053754683
13: is_urogenital_pexam 0.016909827 0.006929173 0.011728295
14: first_who_stage 0.015576339 0.020660829 0.020361622
15: is_on_contraceptive 0.015135534 0.009007978 0.009773579
16: prop_defaulted_apptmts 0.015061570 0.030476917 0.034207526
17: is_on_tb_prophy_regimen 0.014798505 0.006629841 0.017266656
18: has_phdp_referral 0.014453970 0.005445195 0.006352826
19: has_tb_symptoms 0.012362540 0.006140207 0.007493077
20: needs_fam_tx_support 0.009067414 0.006716438 0.010913830
Importance
1: 0.293615243
2: 0.083067386
3: 0.081313208
4: 0.073463121
5: 0.068524036
6: 0.062913742
7: 0.045449906
8: 0.038685462
9: 0.033575324
10: 0.030569616
11: 0.022368534
12: 0.018424878
13: 0.016909827
14: 0.015576339
15: 0.015135534
16: 0.015061570
17: 0.014798505
18: 0.014453970
19: 0.012362540
20: 0.009067414
Appendix
Kappa - similar to Accuracy score, but it takes into account the accuracy that would have happened simply by chance alone. Here is one possible interpretation of Kappa. * Poor agreement = Less than 0.20 * Fair agreement = 0.20 to 0.40 * Moderate agreement = 0.40 to 0.60 * Good agreement = 0.60 to 0.80 * Very good agreement = 0.80 to 1.00